Features in text? And how to do text feature selection?
What is text clustering?
What are the applications?
How to cluster text data?
Features in text? And how to do text feature selection?
What is text clustering?
What are the applications?
How to cluster text data?
Feature selection is the process of selecting a specific subset of the terms of the training set and using only them in the classification algorithm.
high dimensionality of text features
Select the most informative features for model training
Reduce noise in feature representation
Improve final classification performance
Improve training/testing efficiency
Less time complexity
Fewer training data
Let \(p(c | t)\) be the conditional probability that a document belongs to class \(c\), given the fact that it contains the term \(t\). Therefore, we have:
\[\sum^k_{c=1}{p(c | t)=1}\]
Then, the gini-index for the term \(t\), denoted by \(G(t)\) is defined as:
\[G(t) = \sum^k_{c=1}{p(c | t)^2}\]
The value of the gini-index lies in the range \((1/k, 1)\).
Higher values of the gini-index indicate a greater discriminative power of the term t.
If the global class distribution is skewed, the gini-index may not accurately reflect the discriminative power of the underlying attributes.
Other methods
include_graphics("img/page 3.png")
Clustering: the process of grouping a set of objects into clusters of similar objects
Discover “natural structure” of data
Which one is not a text clustering task?
Please go to www.menti.com and use the code 9594 3321
Basic criteria
high intra-cluster similarity
low inter-cluster similarity
No (little) supervision signal about the underlying clustering structure
Need similarity/distance as guidance to form clusters
Organize document collections
Grouping search results
Organize documents by topics
Facilitate user browsing
Partitional clustering
Hierarchical clustering
Topic modeling
Hard clustering: Each document belongs to exactly one cluster
Soft clustering: A document can belong to more than one cluster.
Partitional clustering method: Construct a partition of \(n\) documents into a set of \(K\) clusters
Given: a set of documents and the number \(K\)
Find: a partition of \(K\) clusters that optimizes the chosen partitioning criterion
Typical partitional clustering algorithms
k-means clustering
Assumes documents are real-valued vectors.
Clusters based on centroids of points in a cluster, \(c\):
\[\vec \mu(c)=\frac{1}{|c|}\sum_{\vec a \in c}{\vec x}\]
Select \(K\) random docs \(\{s_1, s_2,… s_K\}\) as seeds.
Until clustering converges (or other stopping criterion):
For each doc \(d_i\):
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
list.files(path='img/', pattern = 'page 26', full.names = TRUE) %>%
image_read() %>% # reads each path file
image_join() %>% # joins image
image_animate(fps=.25) %>% # animates, can opt for number of loops
image_write("img/kmeans.gif") # write to current dir
include_graphics("img/kmeans.gif")
Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.
include_graphics("img/page 33.png")
Typical hierarchical clustering algorithms
Bottom-up agglomerative clustering
include_graphics("img/page 34.png")
Typical hierarchical clustering algorithms
Top-down divisive clustering
Start with all data as one cluster
Repeatedly splitting the remaining clusters into two
include_graphics("img/page 35.png")
Starts with each doc in a separate cluster
The history of merging forms a binary tree or hierarchy.
Many variants to defining closest pair of clusters (linkage methods):
Feature Selection
Text Clustering
Evaluation